This notebook contains notes and beliefs about several commonly-used supervised learning algorithms. My dream is that it will be useful as a quick reference or for people who are irrational about studying for machine learning interviews.
After some setup code, the methods discussed are:
To better understand each classifier we train on various versions of the "two moons" dataset and plot empirical decision boundaries. Each plot shows the training data on top of a few thousand randomly chosen points which have been colored by the output of the learned model. Superstition #1: The plots suggest that linear classifiers are often out performed on high quality training sets but still produce sane results on noisy small datasets. Note: not all the plots have the same xy dimensions.
Other resources:
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets
%pylab inline
#sklearn two moons generator makes lots of these...
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
"""
Build some datasets that I'll demo the models on
"""
Xs = []
ys = []
#low noise, plenty of samples, should be easy
X0, y0 = sklearn.datasets.make_moons(n_samples=1000, noise=.05)
Xs.append(X0)
ys.append(y0)
#more noise, plenty of samples
X1, y1 = sklearn.datasets.make_moons(n_samples=1000, noise=.3)
Xs.append(X1)
ys.append(y1)
#less noise, few samples
X2, y2 = sklearn.datasets.make_moons(n_samples=200, noise=.05)
Xs.append(X2)
ys.append(y2)
#more noise, less samples, should be hard
X3, y3 = sklearn.datasets.make_moons(n_samples=200, noise=.3)
Xs.append(X3)
ys.append(y3)
def plotter(model, X, Y, ax, npts=10000):
"""
Simple way to get a visualization of the decision boundary
by applying the model to randomly-chosen points
could alternately use sklearn's "decision_function"
at some point it made sense to bring pandas into this
"""
xs = []
ys = []
cs = []
for _ in range(npts):
x0spr = max(X[:,0])-min(X[:,0])
x1spr = max(X[:,1])-min(X[:,1])
x = np.random.rand()*x0spr + min(X[:,0])
y = np.random.rand()*x1spr + min(X[:,1])
xs.append(x)
ys.append(y)
cs.append(model.predict([x,y]))
ax.scatter(xs,ys,c=cs, alpha=.35)
ax.hold(True)
ax.scatter(X[:,0],X[:,1],
c=list(map(lambda x:'r' if x else 'lime',Y)),
linewidth=0,s=25,alpha=1)
ax.set_xlim([min(X[:,0]), max(X[:,0])])
ax.set_ylim([min(X[:,1]), max(X[:,1])])
return
Logistic regression is the canoncial example of a "discriminative" classifier (i.e. one that learns the mapping $f:X \rightarrow Y$ directly from the signal as opposed from learning a model of how the data was generated). Here, $Y$ is categorical and $X$ may be either continuous or categorical.
Logistic regression assumes that $f$ is a sigmoid function of an inner product with the features, eg.
$$ P(Y=1 | X; \theta) = \sigma(\theta ^T X) = \frac{1}{1+exp(\theta ^T X)} $$Training a logistic regression classifier amounts to learning the parameter vector, $\theta$. The most simple approach is via maximum liklihood estimation. This leads to a convex unconstrained optimization:
$$ argmin_\theta = \sum_{i=1}^M -log(P(Y_i | X_i; \theta) $$The MLE approach may suffer from overfitting. It is possible to reduce the effect of overfitting by introducing a regularization term and a user-defined parameter, $\beta$ which controls the strength of the regularization. The standard approach often assumes a Laplacian prior, $P(\theta) = (\beta/2)^N exp(-\beta/|\theta|_1)$, and takes a MAP estimate[5]:
$$ argmin_\theta = \sum_{i=1}^M -log(P(Y_i | X_i; \theta) + \beta |\theta|_{1 \text{ or } 2} $$Regularizing with $L_1$ favors a sparser $\theta$ (i.e. a simpler model) at the expense of a more difficult optimization problem.
[1] http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf
[2] https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
[3] https://cs.stanford.edu/people/ang//papers/nips01-discriminativegenerative.pdf
[5] https://cs.stanford.edu/people/ang//papers/aaai06-efficientL1logisticregression.pdf
[6] https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html
[7] http://www.cims.nyu.edu/~tygert/lr.pdf
[8] http://math.arizona.edu/~hzhang/math574m/2014Lect6_LDAlog.pdf
[9] http://webdocs.cs.ualberta.ca/~greiner/C-466/SLIDES/3b-Regression.pdf
import sklearn.linear_model
classifier = sklearn.linear_model.LogisticRegression()
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11,13))
i=0
for X,y in zip(Xs,ys):
classifier.fit(X,y)
plotter(classifier,X,y,ax=axes[i//2,i%2])
i += 1
plt.show()
Decision trees are opposite logistic regression in a few ways. Decision trees are nonlinear, nonparametric, and generate models which are simple to interpret (basically a flowchart). They perform well when given huge amounts of clean data and really shine when the data is not linearly-seperable. Decision trees are often used as an example of a method which is susceptible to overfitting (remark: neural networks are the other example).
The resulting structure is a kd-tree. Applying the classifier to new points amounts to a standard nearest neighbor query.
Tall trees tend to indicate overfitting, pruning can help. Another way to reduce overfitting is to use a "random forest". Random forest means training multiple decision trees and then building a classifier which applies all the trees to a datapoint and takes the mode of the predictions for the result. Two steps in random forest building [2]:
[1] http://scikit-learn.org/stable/modules/tree.html#tree
[2] http://blog.echen.me/2011/03/14/laymans-introduction-to-random-forests/
[3] https://mahout.apache.org/users/classification/partial-implementation.html
[4] https://spark.apache.org/docs/1.2.0/mllib-decision-tree.html
[5] https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36296.pdf
[6] http://web.engr.oregonstate.edu/~xfern/classes/cs534/notes/decision-tree-7-11.pdf
import sklearn.tree
classifier = sklearn.tree.DecisionTreeClassifier()
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11,13))
i=0
for X,y in zip(Xs,ys):
classifier.fit(X,y)
plotter(classifier,X,y,ax=axes[i//2,i%2])
i += 1
plt.show()
Support vector machines are non-probabilistic discriminative linear classifiers. They work by finding the hyperplane that has the largest distance to the nearest training-data point of any class between two sets of points. With the commonly-used "kernel trick" (ie mapping all the points to a higher dimensional space (eg a sphere), finding the maximally-separating hyperplane in the high dimension, and then mapping back) they are able to identify nonlinear decision boundaries.
One-class support vector machines are useful to finds outliers or change point detection [3].
The (grossly oversimplified) idea follows. There exists a theory bound that says something along the lines of:
where "Complexity of set of Models" means something about VC dimension [6]. Intuitively, a simple model drags down 2nd term but implies that training error is big, a complex model increases the 2nd term but reduces training error (ie "bounds on overfitting").
So, we want a model that minimizes the sum of the "training error" and the "complexity". Idea: learn a decision function, $f$, to solve
$$ min_f \sum_i label(f(x_i), y_i) + Complexity $$Use a separating hyperplane to figure $f$, ie $f(x_i) = \theta^T x_i + b$ (the "training error" term). Use a norm and a user-defined parameter to regularize, ie $C||\theta||_{1 or 2}$ (the "complexity" term). Note: some authors have found that an $L_1$ term makes the method less stable [1].
If linear is not enough, do the kernel trick: Represent all the points in a higher dimensional space (eg expaning to a basis of radial basis functions) and run the optimization in the high dimensional space.
[1] http://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf
[2] http://scikit-learn.org/stable/modules/svm.html
[3] https://rvlasveld.github.io/blog/2013/07/12/introduction-to-one-class-support-vector-machines/
[4] https://spark.apache.org/docs/latest/mllib-guide.html
[5] http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
[6] http://www.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf
import sklearn.svm
classifier = sklearn.svm.SVC()
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11,13))
i=0
for X,y in zip(Xs,ys):
classifier.fit(X,y)
plotter(classifier,X,y,ax=axes[i//2,i%2])
i += 1
plt.show()
"Nearest neighbor" is a dead simple non-parametric classifier which assigns to an unseen point, $x$, the value of the training point that it is closest to. "$K$ nearest neighbors" generalizes this idea to $K$ points, i.e. find the $K$ nearest training points to $x$ and use their majority vote to decide on a label for $x$. Larger $K$ leads to smoother decision boundaries.
[1] http://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall07/L4_knn.pdf
[3] https://en.wikipedia.org/wiki/Locality-sensitive_hashing
[4] http://micvog.com/2013/09/08/storm-first-story-detection/
import sklearn.neighbors
classifier = sklearn.neighbors.KNeighborsClassifier()
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11,13))
i=0
for X,y in zip(Xs,ys):
classifier.fit(X,y)
plotter(classifier,X,y,ax=axes[i//2,i%2])
i += 1
plt.show()
Naive Bayes is the canonical example of a "generative" classifier, ie one that first learns a model for randomly generating the training data and then uses the learned probabilitity distributions to classify new points. There are several highly-cited comparisions between naive Bayes and logistic regression.
Deriving naive Bayes isn't completely insane. Just use Bayes rule to start, lump all the normalizations into a constant, and then invoke conditional independence. For example, to learn the probability of label $Y$ on feature $X=(X_0,...,X_N)$ you can do:
\begin{eqnarray} P(Y=y_k|X) &=& \frac{P(Y=y_k)P(X|Y=y_k)}{P(X)}\\ &=& \frac{P(Y=y_k)P(X|Y=y_k)}{\sum_y P(X|Y=y)}\\ &=& \frac{P(Y=y_k)P(X_1 \ldots X_n|Y=y_k)}{\sum_y P(X_1 \ldots X_n|Y=y)}\\ &=& \frac{P(Y=y_k) \prod_iP(X_i|Y=y_k)}{\sum_y \prod_iP(X_i|Y=y)} \end{eqnarray}To make that 3rd equals sign work we needed to assume "conditional independence". From Wikipedia:
R and B are conditionally independent given Y if and only if, given knowledge that Y occurs, knowledge of whether R occurs provides no information on the likelihood of B occurring, and knowledge of whether B occurs provides no information on the likelihood of R occurring.
Think of it like this: suppose you are building a classifier to identify if a particular tweet is about professional wrestling or a superhero movie. Given that a tweet is about wrestling, the knowledge that it contains the term "Hulk" provides no information on how likely it is to contain the term "Hollywood".
Computing the probabilities $P(X_i|Y=y)$ is where the "parametric" part of this comes into play. Typically one uses a Gaussian if the data is continuous and a Multinomial if the data is discrete (eg text). Here, the choice of priors plays the same role that choice of regularization plays in logistic regression.
[1] http://tuvalu.santafe.edu/~aaronc/courses/5352/fall2013/csci5352_2013_L16.pdf
[2] http://stackoverflow.com/questions/19129141/naive-bayes-and-logistic-regression-error-rate
[3] https://cs.stanford.edu/people/ang//papers/nips01-discriminativegenerative.pdf
[4] http://nlp.stanford.edu/IR-book/html/htmledition/linear-versus-nonlinear-classifiers-1.html
[5] http://scikit-learn.org/stable/modules/naive_bayes.html
[6] http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf
[7] http://www.cs.colorado.edu/~jbg/teaching/DATA_DIGGING/lecture_05.pdf
import sklearn.naive_bayes
classifier = sklearn.naive_bayes.GaussianNB()
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11,13))
i=0
for X,y in zip(Xs,ys):
classifier.fit(X,y)
plotter(classifier,X,y,ax=axes[i//2,i%2])
i += 1
plt.show()